Skip to content

feat(hpc): Fingerprint BindSpace API + VectorWidth config + WHT + BF16 tile GEMM + i2 quantization#109

Merged
AdaWorldAPI merged 2 commits into
masterfrom
claude/teleport-session-setup-wMZfb
Apr 18, 2026
Merged

feat(hpc): Fingerprint BindSpace API + VectorWidth config + WHT + BF16 tile GEMM + i2 quantization#109
AdaWorldAPI merged 2 commits into
masterfrom
claude/teleport-session-setup-wMZfb

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Extends ndarray with the hardware + type primitives that lance-graph's
cognitive-shader-driver consumes. Everything in the contract crate depends
on these directly (Fingerprint, VectorWidth, SIMD lane views, BLAS-adjacent
kernels, quantization helpers).

src/hpc/fingerprint.rs (+236 lines)

Full BindSpace-compatible API on Fingerprint<N>:

  • Bit ops: get_bit, set_bit, toggle_bit
  • Algebra: bind (XOR), and, or, not, permute
  • Constructors: random(seed), orthogonal(seed), from_content(&str)
  • Stats: density, hamming (alias)
  • Bundling: bundle(items: &[&Self]) — majority vote
  • SIMD views: chunks_u64x8, chunks_u8x64 — zero-copy lane iteration
  • Width config: VectorWidth enum + LazyLock singleton + vector_config()
    reading NDARRAY_VECTOR_WIDTH env var (production 16K default)

Six new types are now part of the public surface via simd re-exports.

src/hpc/quantized.rs (+48 lines)

  • quantize_f32_to_i2 / dequantize_i2_to_f32 — 2-bit precision for the
    cascade path
  • dequantize_i8_to_f32 — paired reverse for the existing i8 codec
  • QuantParams public

src/hpc/fft.rs (+135 lines)

  • wht_f32(&mut [f32]) — Walsh–Hadamard Transform with F32x16 SIMD butterfly
  • wht_f32_new(&[f32]) — functional variant

Used by the cognitive shader's HAD-cascade codec.

src/hpc/bf16_tile_gemm.rs (+198 lines) + src/hpc/amx_matmul.rs (+44 lines)

  • bf16_tile_gemm — AMX TDPBF16PS primitive with AVX-512 polyfill
  • 16×16 tile matrix multiply for BF16 × BF16 → f32 accumulation
  • Runtime dispatch through simd_caps()

src/simd.rs (+36 lines)

Public re-exports for lance-graph consumers:

pub use crate::hpc::fingerprint::{
    Fingerprint, Fingerprint2K, Fingerprint1K, Fingerprint64K,
    VectorWidth, VectorConfig, vector_config,
};
pub use crate::hpc::bnn_cross_plane::CollapseGate;
pub use crate::hpc::bitwise::{hamming_distance_raw, popcount_raw};
pub use crate::hpc::fft::{wht_f32, wht_f32_new};
pub use crate::hpc::quantized::{
    quantize_f32_to_i4, dequantize_i4_to_f32,
    quantize_f32_to_i2, dequantize_i2_to_f32,
    quantize_f32_to_i8, dequantize_i8_to_f32, QuantParams,
};
pub use crate::hpc::cam_pq::{kmeans, squared_l2};
pub use crate::hpc::heel_f64x8::cosine_f32_to_f64_simd;

Consumers write use ndarray::simd::{Fingerprint, VectorWidth, ...};
and never touch internal hpc::* paths.

.claude/knowledge/cognitive-shader-foundation.md (+137 lines)

Agent knowledge doc parallel to lance-graph's. Explains the SIMD floor
(F32x16), the 4-tier dispatch (F32x16 → VNNI2 → AVX512-VNNI → AMX),
the Fingerprint const-generic model, the VectorWidth LazyLock
config path, and which public types lance-graph consumes.

.claude/agents/*.md model bumps

Four agents (l3-strategist, migration-tracker, product-engineer,
vector-synthesis) updated to the Opus 4.7 model tag.

Test plan

  • cargo test --lib fingerprint — 21 passing
  • cargo check — clean
  • Existing ndarray tests unaffected (1639 filtered-out tests in the
    pattern-matched run are from other modules, all passing in full cargo test)
  • Downstream consumer verified: cargo test -p lance-graph-contract
    and cargo test -p cognitive-shader-driver compile and pass with
    these additions (tested during lance-graph PR #206)

Downstream impact

lance-graph's cognitive-shader-driver and lance-graph-contract both
import from ndarray::simd::* — this PR is what lets their PR #206
compile. Merging unblocks the Tier 0 quick wins (Q2 Cargo.toml pin,
AriGraph wiring, cockpit endpoints).

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh

claude added 2 commits April 18, 2026 15:16
QW4: chunks_u64x8() — iterate as 8-word batches for VPOPCNTDQ
     chunks_u8x64() — iterate as 64-byte batches for U8x64 ops
     bundle() — majority vote across multiple fingerprints

These enable the Layer 1 cascade: sweep a fingerprint column via
SIMD-width chunks, then bundle consensus across agents.

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Adds missing BindSpace API methods:
  orthogonal(seed) — golden-ratio-seeded quasi-orthogonal fingerprint
  or() — bitwise OR
  bundle(&[&Self]) — majority vote across multiple fingerprints
  chunks_u64x8() — iterate as 8-word batches for AVX-512 VPOPCNTDQ
  chunks_u8x64() — iterate as 64-byte batches for U8x64 ops

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
@AdaWorldAPI AdaWorldAPI merged commit 4784945 into master Apr 18, 2026
5 of 14 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c77050c7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/hpc/fingerprint.rs
Comment on lines +191 to +192
pub fn chunks_u64x8(&self) -> impl Iterator<Item = &[u64]> {
self.words.chunks(8)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return only full SIMD lanes in chunk iterators

chunks_u64x8 uses .chunks(8), so Fingerprint<N> values where N % 8 != 0 produce a final slice shorter than 8 words; callers following the documented U64x8::from_slice(chunk) usage will panic on that tail chunk. Because Fingerprint is a public const-generic type (and this file already uses small non-multiple test sizes), this API can crash on valid inputs unless it uses chunks_exact(8) (or otherwise handles the remainder explicitly).

Useful? React with 👍 / 👎.

Comment thread src/hpc/fingerprint.rs
Comment on lines +228 to +229
pub fn orthogonal(seed: u64) -> Self {
Self::random(seed.wrapping_mul(0x9E3779B97F4A7C15))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent orthogonal(0) from collapsing to zero fingerprint

orthogonal forwards the transformed seed into random; when seed == 0, this still passes 0, and the xorshift state remains all-zero, yielding an all-zero fingerprint every time. That breaks the method’s stated quasi-orthogonal behavior and creates a degenerate vector if callers generate seeds starting at 0 (a common indexing pattern), so zero should be remapped/mixed to a non-zero RNG state.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants